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Fan and Li propose a family of variable selection methods via pe- 
nalized likelihood using concave penalty functions. The nonconcave 
penalized likelihood estimators enjoy the oracle properties, but maxi- 
mizing the penalized likelihood function is computationally challeng- 
ing, because the objective function is nondifferentiable and noncon- 
cave. In this article, we propose a new unified algorithm based on the 
local linear approximation (LLA) for maximizing the penalized likeli- 
hood for a broad class of concave penalty functions. Convergence and 
other theoretical properties of the LLA algorithm are established. A 
distinguished feature of the LLA algorithm is that at each LLA step, 
the LLA estimator can naturally adopt a sparse representation. Thus, 
we suggest using the one-step LLA estimator from the LLA algorithm 
as the final estimates. Statistically, we show that if the regularization 
parameter is appropriately chosen, the one-step LLA estimates enjoy 
the oracle properties with good initial estimators. Computationally, 
the one-step LLA estimation methods dramatically reduce the com- 
putational cost in maximizing the nonconcave penalized likelihood. 
We conduct some Monte Carlo simulation to assess the finite sample 
performance of the one-step sparse estimation methods. The results 
are very encouraging. 

1. Introduction. Variable selection and feature extraction are fundamen- 
tal for knowledge discovery and predictive modeling with high-dimensionality 
(Fan and Li [13]). The best subset selection procedure along with traditional 
model selection criteria, such as AIC and BIC, becomes infeasible for feature 
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selection from high-dimensional data due to too expensive computational 
cost. Furthermore, the best subset selection suffers from several drawbacks, 
the most severe of which is its lack of stability as analyzed in Breiman [4]. 
LASSO (Tibshirani [32]) method utilizes the L\ penalty to automatically 
select significant variable via continuous shrinkage, thus retaining the good 
features of both the best subset selection and ridge regression. In the same 
spirit of LASSO, the penalized likelihood with nonconcave penalty func- 
tions has been proposed to select significant variables for various paramet- 
ric models, including generalized linear regression models and robust linear 
regression model (Fan and Li [10] and Fan and Peng [15]), and some semi- 
parametric models, such as the Cox model and partially linear models (Fan 
and Li [11, 12] and Cai, Fan, Li and Zhou [5]). Fan and Li [10] provide deep 
insights into how to select a penalty function. They further advocate the use 
of penalty functions satisfying certain mathematical conditions such that the 
resulting penalized likelihood estimate possesses the properties of sparsity, 
continuity and unbiasedness. These mathematical conditions imply that the 
penalty function has to be singular at the origin and nonconvex over (0, oo). 
In the work aforementioned, it has been shown that when the regulariza- 
tion parameter is appropriately chosen, the nonconcave penalized likelihood 
estimates perform as well as the oracle procedure in terms of selecting the 
correct subset model and estimating the true nonzero coefficients. 

Although nonconcave penalized likelihood approaches have promising the- 
oretical properties, the singularity and nonconvexity of the penalty function 
challenge us to invent numerical algorithms which are capable of maximiz- 
ing a nondifferentiable nonconcave function. Fan and Li [10] suggested iter- 
atively, locally approximating the penalty function by a quadratic function 
and referred such approximation as to local quadratic approximation (LQA). 
With the aid of the LQA, the optimization of penalized likelihood function 
can be carried out using a modified Newton-Raphson algorithm. However, 
as pointed out in Fan and Li [10] and Hunter and Li [20], the LQA algorithm 
shares a drawback of backward stepwise variable selection: If a covariate is 
deleted at any step in the LQA algorithm, it will necessarily be excluded 
from the final selected model (see Section 2.2 for more details). Hunter and 
Li [20] addressed this issue by optimizing a slightly perturbed version of 
LQA, which alleviates the aforementioned drawback, but it is difficult to 
choose the size of perturbation. Another strategy to overcome the computa- 
tional difficulty is using the one-step (or fe-step) estimates from the iterative 
LQA algorithm with good starting estimators, as suggested by Fan and Li 
[10]. This is similar to the well-known one-step estimation argument in the 
maximum likelihood estimation (MLE) setting (Bickel [2], Lehmann and 
Casella [24], Robinson [30] and Cai, Fan, Zhou and Zhou [6]). See also Fan 
and Chen [9] , Fan, Lin and Zhou [14] and Cai et al. [6] for some recent work 
on one-step estimators in local and marginal likelihood models. However, the 
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problem with the one-step LQA estimator is that it cannot have a sparse 
representation, thus losing the most attractive and important property of 
the nonconcave penalized likelihood estimator. 

In this article we develop a methodology and theory for constructing an 
efficient one-step sparse estimation procedure in nonconcave penalized like- 
lihood models. For that purpose, we first propose a new iterative algorithm 
based on local linear approximation (LLA) for maximizing the nonconcave 
penalized likelihood. The LLA enjoys three significant advantages over the 
LQA and the perturbed LQA. First, in the LLA we do not have to delete 
any small coefficient or choose the size of perturbation in order to avoid nu- 
merical instability. Second, we demonstrate that the LLA is the best convex 
minorization-maximization (MM) algorithm, thus proving the convergence 
of the LLA algorithm by the ascent property of MM algorithms (Lange, 
Hunter and Yang [23]). Third, the LLA naturally produces a sparse esti- 
mates via continuous penalization. We then propose using the one-step LLA 
estimator from the LLA algorithm as the final estimates. Computationally, 
the one-step LLA estimates alleviate the computation burden in the iterative 
algorithm and overcome the potential local maxima problem in maximizing 
the nonconcave penalized likelihood. In addition, we can take advantage 
of the efficient algorithm for solving LASSO to compute the one-step LLA 
estimator. Statistically, we show that if the regularization parameter is ap- 
propriately chosen, the one-step LLA estimates enjoy the oracle properties, 
provided that the initial estimates are good enough. Therefore, the one-step 
LLA estimator can dramatically reduce the computation cost without losing 
statistical efficiency. 

The rest of the paper is organized as follows. In Section 2 we introduce 
the local linear approximation algorithm and discuss its various properties. 
In Section 3 we discuss the one-step LLA estimator, in which asymptotical 
normality and consistency of selection are established. Section 4 describes 
the implementation detail, and Section 5 shows numerical examples. Proofs 
are presented in Section 6. 

2. Local linear approximation algorithm. Suppose that {(xj,yj)" =1 } are 
n identically and independently distributed samples, where Xj denotes the 
p-dimension predictor and yi is the response variable. Assume that yi de- 
pends on Xj through a linear combination xJ/3, and the conditional log- 
likelihood given Xj is £i({3,<fi) = £i(~xj (3,yi,4>), where is a dispersion pa- 
rameter. In some models, such as logistic regression and Poisson regression, 
there is no dispersion parameter. In linear regression model, 4> is the variance 
of the random error, and is often estimated separately after (3 is estimated. 
In most variable selection applications, we do not penalize the dispersion 
parameter (Frank and Friedman [16], Tibshirani [32], Fan and Li [10] and 
Miller [28]). Thus, we simplify notation in the reminder of this paper by 
suppressing cj>, and further use li({3) to stand for ^(xJ/3, yi, 4>). 
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2.1. Penalized likelihood. In the variable selection problem, the assump- 
tion is that some components of (3 are zero. The goal is to identify and 
estimate the subset model. In this work, we consider the variable selection 
methods by maximizing the penalized likelihood function taking the form 

(2.i) W) = EW)-»EPA i (I^D- 

i=l j=l 

In principle, p\ j can be different for different components (coefficients). For 
ease of presentation, we let PXj{\Pj\) = P\{\Pj\)> that is, the same penalty 
function is applied to every component of f3. Formulation in (2.1) includes 




Fig. 1. Plot of local quadratic approximation (thin dotted lines) and local linear approxi- 
mation (thick broken lines) at p = A and 1. (a) and (b) are for the Z/o.5 penalty with A = 2, 
and (c) and (d) are for the SCAD penalty with A = 2. 
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many popular variable selection methods. For instance, the best subset se- 
lection amounts to using the Lq penalty, while the LASSO (Tibshirani [32]) 
uses the L\ penalty pa(|/3|) = A|/3|. Bridge regression (Frank and Friedman 
[16]) uses the L q penalty Pa(|/?|) = Al/?) 9 . When < q < 1, the L q penalty is 
concave over (0,oo), and nondifferentiable at zero. The SCAD penalty (Fan 
and Li [10]) is a concave function defined by pa(0) = and for \(3\ > 

(2.2) p' x (\p\) = XI(\P\ < A) + (aA ~ 1(\(3\ > A) for some a > 2. 

a — 1 

Often a = 3.7 is used. The notation z + stands for the positive part of z: z + is 
z if z > 0, zero otherwise. The SCAD penalty and Lo.5 penalty are illustrated 
in Figure 1. Note that with a concave penalty the penalized likelihood in 
(2.1) is a nonconcave function. Hence maximizing nonconcave penalized like- 
lihood is challenging. Antoniadis and Fan [1] proposed nonlinear regularized 
Sobolev interpolators (NRSI) and regularized one-step estimator (ROSE) for 
nonconvex penalized least squares problems under wavelets settings. They 
further introduced the graduated nonconvexity (GNC) algorithm for min- 
imizing high-dimensional nonconvex penalized least squares problem. The 
GNC algorithm was first developed for reconstructing piecewise continuous 
images (Black and Zisserman [3]). The GNC algorithm offers nice ideas for 
minimizing high-dimensional nonconvex objective function, but in general, 
it is computationally intensive, and its implementation depends on a se- 
quences of tuning parameters. Fan and Li [10] proposed the local quadratic 
approximation (LQA) algorithm for the nonconcave penalized likelihood. 
We introduce the LQA algorithm in Section 2.2 in detail. Hunter and Li [20] 
showed that the LQA shares the same spirit as that of the MM algorithm 
(Lange et al. [23]). Wu [33] pointed out that the MM algorithm and GNC 
algorithm share the same spirit in terms of optimization transfer. In general, 
the GNC algorithms do not guarantee the ascent property for maximization 
problems, evidenced from Figure 8(c) in Antoniadis and Fan [1], while the 
MM algorithms enjoy the ascent property, as demonstrated in Hunter and 
Li [20]. 

2.2. Local quadratic approximation. It can be seen from Figure 1 that 
the penalized likelihood functions become nondifferentiable at the origin 
and nonconcave with respect to f3. The singularity and nonconcavity make 
it difficult to maximize the penalized likelihood functions. Suppose that we 
are given an initial value (3^ that is close to the true value of (3. Fan and Li 
(2001) propose locally approximating the first order derivative of the penalty 
function by a linear function: 
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Thus, they use a LQA to the penalty function: 

pxm\)^px(\pf\) + i{ P ' x (\pf\)/\pf inn] - pf 2 ) 



(2.3) 



for Pj^pf. 



Figure 1 illustrates the LQA for the Lo.5 penalty and the SCAD penalty. 
With iteratively updating the LQA, Newton-Raphson algorithm can be 
modified for maximization of the penalized likelihood function. Specifically, 
we take the unpenalized likelihood estimate to be the initial value f3^ : For 
k = 1, 2, . . . , repeatedly solve 

(2.4) /3M = arg max/ ± I, (J3) - n £ ^jj^fi? ) ■ 

U=l j=l I J 

Stop the iteration if the sequence of {(3^} converges. 

To avoid numerical instability, Fan and Li [10] suggested that if (5^ in 

(2.4) is very close to 0, say |/3j | < £o (a prespecified value), then set 0j = 
and delete the jth component of x from the iteration. Thus, the LQA al- 
gorithm shares a drawback of backward stepwise variable selection: if a co- 
variate is deleted at any step in the LQA algorithm, it will necessarily be 
excluded from the final selected model. Furthermore, one has to choose Eq, 
which practically becomes an additional tuning parameter. The size of eo 
potentially affects the degree of sparsity of the solution as well as the speed 
of convergence. Hunter and Li [20] studied the convergence property of the 
LQA algorithm. They found that the LQA algorithm is one of minorize- 
maximize (MM) algorithms, extensions of the well-known EM algorithm. 
They further demonstrated that the behavior of the LQA algorithm is the 
same as that of an EM algorithm with the LQA playing the same role of 
E-step in the EM algorithm. To avoid numerical instability and the draw- 
back of backward stepwise variable selection, Hunter and Li [20] suggested 
optimizing a slightly perturbed version of (2.4) bounding the denominator 
away from zero: for k = 1, 2, . . . , repeatedly solve 

(fc+i) = arama J V- ojr\ _„y- PWj^J) 



(2.5) /3 (fc+1) =argmax^W) ~ n J2 



!2{|/?f |+ro}' 

for a prespecified size perturbation To- Stop the iteration if the sequence of 
{(3^ k '} converges. In the practical implementation, we have to determine the 
size of perturbation. This sometimes may be difficult, and furthermore, the 
size of to potentially affects the degree of sparsity of the solution as well as 
the speed of convergence. 
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2.3. Local linear approximation. To eliminate the weakness of the LQA, 
we propose a new unified algorithm based on local linear approximation to 
the penalty function: 

(2.6) p\(\Pj\) «PA(|/3f|) + p'x(\Pf\)(Wj\ ~ for ^ «/5f . 

Figure 1 illustrates the LLA for the Lo.5 penalty and the SCAD penalty. 
Fan and Li [10] show that in order to have a continuous thresholding rule, 
the penalty function must satisfy a continuity condition: the minimum of 
\9\ +p'\(\0\) is attained at zero. Although the L0.5 penalty fails to hold the 
continuity condition, we show in Section 3 that it is still good for deriving 
continuous one-step sparse estimates. For ease of presentation, we assume 
in this section, unless otherwise specified, that the right derivative of p\(-) 
at is finite. 

Similar to the LQA algorithm, the maximization of the penalized like- 
lihood can be carried out as follows. Set the initial value (3^ be the un- 
penalized maximum likelihood estimate. For k = 1, 2, ... , repeatedly solve 

(2.7) /3( fe+1 ) = argmax{x:^(/3)-nEPA(l/3f ) |)l^iy 

u=i j=l ) 

Stop the iterations if the sequence of {(3^ k '} converges. We refer this algo- 
rithm as to the LLA algorithm. The LLA algorithm is distinguished from 
the LQA algorithm in that f3^ k+1 ^ and the final estimates naturally adopt 
a sparse representation. The LLA algorithm inherits the good features of 
LASSO in terms of computational efficiency, and therefore the maximiza- 
tion can be solved by efficient algorithms, such as the least angle regression 
(LARS) algorithm (Efron, Hastie, Johnstone and Tibshirani [8]). From (2.7), 
the approximation is numerical stable, and thus, the drawback of backward 
variable selection can be avoided in LLA algorithm. 

We next study the convergence of the LLA algorithm. Denote 

0* (fti/sf ) = PX (\pf\) + P ' x (\(3f\m\ - 

and 

n p 

G(f3\ ) = J2 4 (P) ~ n E ^* (Pi I /f } ) • 
i=i 3=1 

Theorem 1. For a differentiable concave penalty function px(-) on [0,oo), 
we have 

(2.8) Q(P) > G{f3\f3 {k) ) and Q((3 {k) ) = G(/3 (fc) |/3 (fe) ). 



8 



H. ZOU AND R. LI 



Furthermore, the LLA has the ascent property, that is, for all k = 0, 1, 2, . . . 

(2.9) Q{(3 {k+l) )>Q((3^). 

If the penalty function is strictly concave then we always take ">" in (2.9). 

From (2.8), G(/3|/3^) is a minorization of Q(/3), and finding (3 <yk+l ' > is the 
maximize-step in MM (minorize-maximize) algorithms. Therefore, the LLA 
algorithm is an instance of the MM algorithms. For a survey of work in MM 
algorithms, see Heiser [19] and Lange, Hunter and Yang [23]. 

The analysis of convergence of LLA can be done by following the general 
convergence results for MM algorithms. Let M{j3) denote the map defined 
by the LLA algorithm from (3^ to P^ k+1 \ Note that the penalty function 
has continuous first derivative and solving p^ k+1 ^ is a convex optimization 
problem, thus M is a continuous map. We define a stationary point of the 
function Q((3) to be any point (3 at which the gradient vectors is zero. 

PROPOSITION 1. Given an initial value (3^\ let (3^ = M k ((3^). If 
Q({3) = Q{M{(3)) only for stationary points of Q and if (3* is a limit point 
of the sequence {/3^}, then (3* is a stationary point of Q((3). 

Proposition 1 is a slightly modified version of Lyapunov's theorem in 
Lange [22]. We omit its proof. In Theorem 1, we show that the LLA of p\(-) 
provides a majorization of the penalty function p\(-). In fact, the LLA is 
the best convex majorization of p\(-) as stated in the next theorem. 

Theorem 2. Denote by i>*{-) the LLA approximation of px(-). ip*(t) = 
P\(to) +2 j a(*o)(* — to)> tjto — 0- Suppose that ip(-) is a convex majorization 
function of px(-) at to, that is, 

tp(to) =P\(to) and ip(t)>p x (t) for all t. 

We must have ip{t) > ijj*(t) for all t. If the right derivative of p\(-) at zero 
diverges, the above conclusions hold for to > and t>0. 

Figure 1 shows an illustration of Theorem 2 with the SCAD and L0.5 
penalties. As can be seen from Figure 1, the LLA approximation is under- 
neath the LQA approximation in all four cases. 

The ascent property of MM indicates that MM is an extension of the 
famous EM algorithm. Under certain conditions, we show that the LLA 
algorithm can be cast as an EM algorithm. 

Suppose that exp(— np\{-)) is a Laplace transformation of some nonneg- 
ative function H(-). Then H(-) is the inverse Laplace transformation of 
exp(—np\(-)) and 

(2.10) exp(-n Px (\P\)) = H^e^ dt. 

Jo 
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For example, if Pa(|/5|) = ^|/5| 9 > the Bridge penalty (0 < q < 1), then 



exp(-nA|/f 



H{t)e~ m dt, 



where H(t) cx -S(^^ — -t) and S(-) is the density of the stable distri- 
bution of index q (Mike [27]). 

Let Tr(t) = jH(j) and we independently put a Laplacian prior on /3j 



(2.ii) p{[ij \ Tj)= e -m/r K 

Further regard tt as a hyper-prior on tj. Then (2.10) implies 



(2.12) 



exp(-npx(\/3j\)) 



Maximizing Q((3) is equivalent to computing the posterior mode of p((3\y), 
if we treat exp(— np\{\(3j |)) as the marginal prior of (3. The identity (2.12) 
implies an EM algorithm for maximizing the posterior p{(3\y). 

To derive the EM algorithm, we consider ti,...,t p as missing data. The 
complete log-likelihood function (CLF) is 



£W)+£ 



i=i 



p r 1/3-1 

log(2r,) -gi+iogTT^; 



Suppose the current estimator is (3^ k ' . The E-step computes the conditional 
mean of CLF 



8=1 



3=1 



log(2r j ) 



+ log7r(r J )|/3 (fc) ,; 



The M-step finds /3 (fc+1) maximizing ^(^[CLF]. Thus 



(2.13) /3 



(fc+i) 



argmax£^(/3) + £ ( -\Pj\E 

3=1 



i=l 



Theorem 3. Suppose that (2.10)-(2.13) hold for p\(-) , the LLA algo- 
rithm and the EM algorithm are identical. Moreover, (2.10) implies that 
Px(-) must be a strictly increasing function on [0, oo) and unbounded. Thus 
the SCAD penalty does not have an inverse Laplace transformation. 

In the above discussion, we have assumed all the necessary conditions to 
ensure the the EM algorithm is proper. If this is the case, then Theorem 3 
shows that the EM algorithm is exactly the LLA algorithm. On the other 
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hand, it is also worth noting that there are concave penalty functions for 
which (2.10) cannot be true. The SCAD penalty is such an example. Thus, 
Theorem 3 also indicates that MM algorithms are more flexible than EM 
algorithms. 

3. One-step sparse estimates. In this section, we propose the one-step 
LLA estimator, which is significantly distinguished from the one-step or k- 
step LQA estimate because it automatically adopts a sparse representation. 
Thus it can be used as a model selector. One may further define /c-step 
LLA estimator, but, in general, it is unnecessary. As demonstrated in Fan 
and Chen [9] and Cai, Fan and Li [7], both empirically and theoretically, 
the one-step method is as efficient as the fully iterative method, provided 
that the initial estimators are reasonably good. In LQA finding (3^ k+1 ^ is 
a ridge regression problem, which indicates that almost surely, none of the 
components of (3^ k+l ^ will be exact zero. Hence the one-step or fc-step LQA 
estimates in the LQA will not be able to achieve the goal of variable selection. 
To get insights into the one-step LLA estimator, let us start with linear 
regression models and consider the penalized least squares. 

3.1. Linear regression models. The LLA algorithm naturally provides a 
sparse one-step estimator. For simplicity, let the initial estimate be 
ordinary least squares estimator. Then the one-step estimator is obtained 
by 

(3.1) /3« = argmin±||y - X/3|| 2 + nf^p'^f \)\0A. 

i=i 

We denote by /3(ose) the one-step estimator j3^\ 

We show that the one-step estimator enjoys the oracle properties. To this 
end we assume two regularity conditions: 

(Al). yi = Xj/3 + €i, where ei ,...,€„ are independent and identically dis- 
tributed random variables with mean and variance a 2 , 
(A2). iX T X -» C where C is a positive definite matrix. 

Without loss of generality, let (3 = (/?oii • • • > A)p) T = (/3io>/^2o) T an d /^o = 0- 
We write 




Theorem 4. Letp\ n (-) be the SCAD penalty. If y/nX n — > oo and X n — > 0, 
then the one-step SCAD estimates /3(ose) must satisfy: 
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(a) Sparsity: with probability tending to one, /3(ose)2 = 0. 

(b) Asymptotic normality: y/n((3(ose)i — /3 10 ) — ► N(0,a 2 Cii). 

In addition, consider p\ n (-) = \ n p{-). Suppose p' '(•) is continuous on (0, oo) 
and there is some s > such that p'(0) = 0(9~ s ) as 9 — > 0+. T/ien (a) and 
(b) hold, if n( 1+s ^ 2 \ n — > oo and ^/nA^^O. 

3.2. Penalized likelihood. For a general likelihood model, let ^(/3) = 
X^Ll^iO^) denote the log- likelihood. Suppose that the log-likelihood func- 
tion is smooth and has the first two derivatives with respect to (3. For a given 
initial value (3^ , the log-likelihood function can be locally approximated by 

£(P) « t{[3 {0) ) + W(/3 (0) ) T (/3 - (3 {0) ) 

(3.2) 

+ i^- /3(°)) T V^(/3(°))(/3 -^°)). 

Let us take /3 (0) = 3(mle). Then W(/3 (0) ) = by the definition of MLE. 
Thus, (3^ is given by 

/3 (1) = arg min U{3- /3 (0) ) T [- V 2 l(f3 {0) )] (/3 - /3 (0) ) 

(3.3) 

+ »£j4(l/*j 0) l)l&l- 

It is interesting to see that (3.3) reduces to the one-step estimates in linear 
regression models, if we are willing to assume that e ~ iV(0, a 2 ). However, it 
should be noted that normality assumption is not needed in Theorem 4. 

We show that in the general likelihood setting, is desired the one-step 
estimates, denoted by (3(ose). Let I(/3q) be the Fisher information matrix 
and ii(/3 10 ) =ii(/3 10 ,0) denote the Fisher information knowing (3 2 q = 0. 
Note that I((3 ) is a p x p matrix and ii(/3 10 ) is a submatrix of I((3q). It 
is well known that under some regularity conditions (Lehmann and Casella 
[24]), n- 1 V 2 £(3(mle)) -> P -I(/3 ), and 

Vn~((3 - 3(mle)) Z W = N(0, r\(3 )). 

Theorem 5. Letp\ n (-) be the SCAD penalty. If y/n\ n — ► oo and X n — ► 0, 
then the one-step SCAD estimates /3(ose) must satisfy: 

(a) Sparsity: with probability tending to one, /3(ose)2 = 0. 

(b) Asymptotic normality: ^/n(f3(ose)i — /3 10 ) — ► A r (0,/ 1 ~ 1 (/3 10 )). 

In addition, consider p\ n (-) = X n p(-). Suppose p '(•) zs continuous on (0, oo) 
and there is some s > suc/i i/iaf p'(#) = O(0~ s ) as 6 — > 0+. Then (a) and 
(b) hold, if n( 1+s ^ 2 \ n — > oo and y^An^O. 
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In Theorems 4 and 5, we have established the oracle properties of the 
one-step SCAD estimator. It is interesting to note that the choice of A n is 
the same as that in Theorem 2 of Fan and Li [10]. It is also worth noting 
that our results require less regularity conditions than Theorem 2 of Fan 
and Li [10], for the penalty function does not need to be twice differ entiable. 

3.3. Continuity of the one-step estimator. For the nonconcave penalized 
likelihood estimates to be continuous, the minimum of the function \6\ + 
p'x(\0\) must be attained at (Fan and Li [10]). Bridge penalty (0 < q < 1) 
fails to satisfy the continuity condition, thus it is considered suboptimal (Fan 
and Li [10]). Our results require weaker conditions to ensure a continuous 
thresholding estimator. Note that (3(ose) is obtained through an l\ penalized 
criterion. Therefore, we only require p^(|#|) is continuous for \Q\ > to ensure 
the continuity of /3(ose). Theorem 4 and Theorem 5 indicate that Bridge 
penalty, Pa(|0|) = ^l#l 9 f° r 0< q < 1, can be used in the one-step estimation 
scheme and their one-step estimates are continuous. 

There is another interesting implication of the continuity of /3(ose). Sup- 
pose two penalty functions have very similar derivatives, then we expect 
their one-step estimators are very close, too. To illustrate this point, we 
consider the limiting one-step estimator with the L q penalty when q — > 0+: 

/3( 1 )=argmini(/3-/3(°)) T [-VM/3 (0) )](/3-/3 (0) ) + nEAg|^ V-V J l- 

For each fixed q, we are interested in the whole profile of (3^ as a function 
of A. Thus we can consider A* = Xq as the effective regularization parameter. 
On the other hand, suppose we consider the one-step estimator with the 
logarithm penalty, £>a(|/?|) = A log \(3\, 

/3« = argminf (/3 - (3^) T [-V 2 £((3^)](P - (3®) + n^MPfY'lPjl 

i=i 

Proposition 2. If q — > 0+, then the profile of converges to the 
profile o//3jog in the sense that lim q ^o + (3^ (X / q) = (3[^ g (X), VA > 0. 

We make a note that the convexity of the LLA is crucial for Proposition 2. 
We demonstrate the continuity property of the one-step estimator in linear 
regression models with an orthogonal design. As can be seen from Figure 2, in 
orthogonal design the Lo.oi penalty and the logarithm penalty are equivalent 
to some discontinuous thresholding rules, but their one-step estimators yield 
continuous thresholding rules. Moreover, the one-step -Lq.oi estimator with 
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A = 200 is very similar to the one-step logarithm estimator with A = 2, which 
shows us an illustrative example of Proposition 2. We also show the SCAD 
thresholding and its one-step version in Figure 2. They are both continuous 
and unbiased for large coefficients, but they are not identical. 

4. Implementation. In this section we show that the LLA allows an ef- 
ficient implementation of the one-step sparse estimator. The key is to no- 
tice that solving is not much different from solving LASSO. Standard 
quadratic programming software can be used to solve LASSO. The shooting 
algorithm also works well (Fu [17] and Yuan and Lin [35]). Efron et al. [8] 
proposed an efficient path algorithm called LARS for computing the entire 
solution path of LASSO. See also the homotopy algorithm by Osborne, Pres- 
nell and Turlach [29]. The LARS algorithm is a major breakthrough in the 
development of the LASSO-type methods. Zou and Hastie [36] modified the 
LARS algorithm to compute the solution paths of the elastic net. Rosset 
and Zhu [31] generalized the LARS type algorithm to a class of optimiza- 
tion problems with a LASSO penalty. The LARS algorithm was used to 
simplify the computations in an empirical Bayes model for LASSO (Yuan 
and Lin [34]). 

We adopt the LARS idea in our implementation. Write fa = and 
&i = Observe that 



In linear regression models, Djj = 2. We separately discuss the algorithm for 
two types of concave penalties. 

Type 1. p\(t) = Xp(t) and p'(t) > for all t. Bridge penalties and the 
logarithm penalty belong to this category which also covers many other 
penalties. We propose the following algorithm to compute the one-step es- 
timator. 

Algorithm 1. 

Step 1. Create working data by 



-X7 2 £((3 {0) ) =X T DX 



where D is a n x n diagonal matrix with 




n. 





i = l,2,...,n;j = l,... 



Step 2. Apply the LARS algorithm to solve 
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Fig. 2. Compare thresholding rules in orthogonal design, (a) and (b) are for the logarithm 
penalty and its one-step LLA approximation, A = 2. (c) and (d) are for Bridge (Lo.oi) and 
its one-step LLA approximation, X = 200. (e) and (f) are for SCAD and its one-step LLA 
approximation, A = 2. 



ONE-STEP SPARSE ESTIMATES 



15 



Then, it is not hard to show that 

/fMjMG/jfl), j = i,2,... iP . 

Thus, if (3% 7^ 0, then Xj is selected in the final model. 

Type 2. For some penalties, the derivative can be zero. In addition, the 
regularization parameter A cannot be separated from the penalty function. 
The SCAD penalty is a typical example. Let us assume that 

U = {j:p' x (\Pf\)=0} and V = {j :p' x (\pf |) > 0}. 

We write 

X = [Xu,X v ) and (1) = (/3< 1 £,/3 (1) y) T . 
We propose the following algorithm to compute (3^. 

Algorithm 2. 

Step la. Create working data by y* = \ / T)afii, x* = ^/DjjXj, i = 1, . . . , n; 
Step lb. Let x* = x* — ^rr for j £ V. 

Step lc. Let Hu be the projection matrix in the space of {x-j,j G U}. 
Compute y** = y* - H uy * and X^* = X.* V - H V X^. 
Step 2. Apply the LARS algorithm to solve 

P* v = argmin{i||y** - X*J (3\\ 2 + nA||/3||i}. 

P 

Step 3. Compute & = (X*/ X^^Xf (y* - X^). 
Then, it is not hard to show that 

$>=fc and ^ (1) =/31— 4)T iorjeV. 

P\{\Pj I) 

Thus, if Pj 7^ 0, then Xj is selected in the final model for j £ V. 

In both algorithms the LARS step uses the same order of computations 
of a single OLS fit (Efron et al. [8]). Thus it is very efficient to compute 
the one-step estimator. It is also remarkable that if the penalty is of type 
1, then the entire profile of the one-step estimator (as a function of A) can 
be efficiently constructed. For the SCAD type penalty, we still need to solve 
the one-step estimator for each fixed A, for the sets U and V could change 
as A varies. 
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5. Numerical examples. In this section we assess the finite sample per- 
formance of the one-step sparse estimates for linear regression models, lo- 
gistic regression models and Poisson regression models in terms of model 
complexity (sparsity) and model error, defined by 

Af£{/i(-)} = £{A(x)-Mx)} 2 

for a selected model /*(•)) where the expectation is taken over the new ob- 
servation x. We compare their performance with that of the SCAD with the 
original LQA algorithm (Fan and Li [10]) and the perturbed LQA algorithm 
(Hunter and Li [20]), and the best subset variable selection with the AIC, 
and BIC. For a fitted subset model A4, the AIC and BIC statistics are of 
the form 

21og(likelihood)-A-|.M|, 

where \M\ is the size of the model and A = 2 and log(n), respectively. Note 
that the BIC is a consistent model selection criterion, while AIC is not. We 
further demonstrate the proposed methodology by analysis of a real data 
set. 

In our simulation studies, we examine the performance of one-step sparse 
estimates with the SCAD penalty, logarithm penalty (defined in Section 
3.3) and Lo.oi penalty. Note that we expect the logarithm penalty and Lo.oi 
penalty generate similar one-step sparse estimators. In Tables 1-3, one-step 
SCAD, one-step LOG and one-step Lo.oi stand for the one-step sparse esti- 
mate with the SCAD, logarithm and -Lo.oi penalty, respectively; SCAD and 
P-SCAD represent the penalized least squares or likelihood estimators with 
the SCAD penalty using LQA and perturbed LQA algorithm, respectively; 
and AIC and BIC are the best subset variable selection with the AIC and 
BIC criteria, respectively. For the best subset variable selection, we exhaus- 
tively searched over all possible subsets. We used five- fold cross-validation 
to select the tuning parameters. 

Example 1 (Linear model). In this example, simulation data were gen- 
erated from the linear regression model, 

y = x T /3 + e, 

where (3 = (3, 1.5, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0) T , e ~ N(0, 1) and x is multivariate 
normal distribution with zero mean and covariance between the ith and jth 
elements being p^~^ with p = 0.5. In our simulation, the sample size n is 
set to be 50 and 100. For each case, we repeated the simulation 1,000 times. 

For linear model, model error for fi = x T /3 is ME(jl) = ((3 — /3) T £'(xx T )( / 9 — 
(3). Simulation results are summarized in Table 1, in which MRME stands 
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for median of ratios of ME of a selected model to that of the ordinary least 
squares estimate under the full model. Both the columns of "C" and "IC" 
are measures of model complexity. Column "C" shows the average num- 
ber of nonzero coefficients correctly estimated to be nonzero, and column 
"IC" presents the average number of zero coefficients incorrectly estimated 
to be nonzero. In the column labeled "Under-fit," we presented the propor- 
tion of excluding any nonzero coefficients in 1,000 replications. Likewise, we 
reported the probability of selecting the exact subset model and the proba- 
bility of including all three significant variables and some noise variables in 
the columns "Correct-fit" and "Over- fit," respectively. 

As can be seen from Table 1, all variable selection procedures dramatically 
reduce model error. One-step SCAD has the smallest model error among 
all competitors, followed by the SCAD and perturbed-SCAD. In terms of 
model error, penalized least squares methods with concave penalties outper- 
form the best subset selection. In terms of sparsity, one-step SCAD also has 
the highest probability of correct fit. The SCAD penalty performs better 
than the other penalties. One-step LOG and one-step Lo.oi perform very 
similarly, which numerically confirms the assertion in Proposition 2. It is 
also interesting to note that a simulation study by Leng, Lin and Wahba 
[25] showed that in this example the LASSO did not consistently select the 
true model when optimizing the prediction error. In contrast, the noncon- 

Table 1 

Simulation results for linear regression models 



No. of Zeros Proportion of 



Method 


MRME 


C 


IC 


Under-fit 


Correct-fit 


Over-fit 








n = 50 








One-step SCAD 


0.208 


3.00 


0.55 


0.000 


0.771 


0.229 


One-step LOG 


0.263 


3.00 


0.89 


0.000 


0.559 


0.441 


One-step Lo.oi 


0.262 


3.00 


0.90 


0.000 


0.555 


0.445 


SCAD 


0.233 


3.00 


0.83 


0.000 


0.682 


0.318 


P-SCAD 


0.235 


3.00 


0.64 


0.000 


0.701 


0.299 


AIC 


0.660 


3.00 


1.84 


0.000 


0.195 


0.805 


BIC 


0.401 


3.00 


0.63 


0.000 


0.576 


0.424 








n= 100 








One-step SCAD 


0.234 


3.00 


0.55 


0.000 


0.784 


0.216 


One-step LOG 


0.281 


3.00 


0.71 


0.000 


0.657 


0.343 


One-step Lo.oi 


0.281 


3.00 


0.71 


0.000 


0.657 


0.343 


SCAD 


0.252 


3.00 


0.75 


0.000 


0.732 


0.268 


P-SCAD 


0.262 


3.00 


0.63 


0.000 


0.711 


0.289 


AIC 


0.676 


3.00 


1.63 


0.000 


0.192 


0.808 


BIC 


0.337 


3.00 


0.32 


0.000 


0.728 


0.272 
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cave penalty methods and their one-step estimates all work very well in this 
example because of their oracle properties. 

Example 2 (Logistic regression). In this example, we simulated 1,000 
data sets consisting of n = 200 observations from the model 

Y\x ~ Bernoulli{p(x T /3)}, 

where p(u) = exp(tt)/(l + exp(u)), and (3 is the same as that in Example 1. 
The covariate vector x is created as follows. We first generate z from a 12- 
dimensional multivariate normal distribution with zero mean and covariance 
between the ith. and jth elements being p^ l ~^ with p = 0.5. Then we set 
x 2k-i = Z2k-i and X2k = I( z 2k < 0) for k = 1, . . . , 6, where /(•) is an indicator 
function. Thus, x has continuous as well as binary components. 

Unlike the model error for linear regression models, there is no closed form 
of model error for the logistic regression model. In this example, the model 
error was estimated using Monte Carlo simulation. Simulation results are 
summarized in Table 2, in which MRME stands for median of ratios of ME 
of a selected model to that of the un-penalized maximum likelihood estimate 
under the full model, and other notation is the same as that in Table 1. 

From Table 2, it can be seen that the best subset variable selection with 
the BIC criterion performs the best, however, the computational cost of 
the best subset variable selection is much more expensive than that of the 
nonconcave penalized likelihood approach. One-step sparse estimates require 
the least computational cost. It is interesting to see from Table 2 that the 
one-step SCAD performs as well as the fully iterative SCAD estimates by the 
LQA and perturbed LQA algorithms in terms of model error. The one-step 
estimates with logarithm and -Lo.oi penalties perform very well. They have 
lower model error and rate of under- fit models than ones with the SCAD 
penalty. 

Table 2 

Simulation results for logistic regression model 



No. of Zeros Proportion of 



Method 


MRME 


C 


IC 


Under-fit 


Correct-fit 


Over-fit 


One-step SCAD 


0.238 


2.95 


0.82 


0.051 


0.565 


0.384 


One-step LOG 


0.229 


2.97 


0.61 


0.029 


0.518 


0.453 


One-step Lo.oi 


0.230 


2.97 


0.61 


0.028 


0.516 


0.456 


SCAD 


0.238 


2.92 


0.51 


0.076 


0.706 


0.218 


P-SCAD 


0.237 


2.92 


0.50 


0.079 


0.707 


0.214 


AIC 


0.596 


2.98 


1.56 


0.021 


0.216 


0.763 


BIC 


0.208 


2.95 


0.22 


0.053 


0.800 


0.147 
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Example 3 {Poisson log-linear regression). In this example, we consid- 
ered a Poisson regression model 

Y|x~Poisson{A(x T /3)}, 

where \(u) = exp(ii), (3 = (1.2, 0.6, 0, 0, 0.8, 0, 0, 0, 0, 0, 0, 0) T and x is the 
same as that of Example 1. We let the sample size be 60 and 120. For each 
case we simulated 1,000 data sets. Note that the model error is ME{(3) = 
£{exp(x T /3) - exp(x T /3)} 2 . Since x is normally distributed, we can derive 
a closed form for the model error using the moment generating function of 
normal distribution. Simulation results are summarized in Table 3, in which 
notation is the same as that in Table 2. 

From Table 3, we can see that one-step SCAD sparse estimate outper- 
forms the SCAD using both the original LQA algorithm and perturbed LQA 
algorithm in terms of model errors, model complexity and the rate of correct- 
fit. The best subset variable selection has the best rate of correct-fit for both 
n = 60 and 120. The correct-fit rate of one-step sparse estimates becomes 
much higher when the sample size increases from 60 to 120. This is not case 
for SCAD, P-SCAD and the best subset variable selection procedures. 

Example 4 {Data analysis). In this example, we demonstrate our one- 
step estimation methodology using the burns data, collected by the General 

Table 3 

Simulation results for Poisson regression models 



No. of Zeros Proportion of 



Method 


MRME 


C 


IC 


Under-fit 


Correct-fit 


Over-fit 








n = 60 








One-step SCAD 


0.284 


2.99 


1.35 


0.011 


0.386 


0.603 


One-step LOG 


0.260 


2.99 


1.10 


0.006 


0.460 


0.534 


One-step Lo.oi 


0.260 


2.99 


1.10 


0.006 


0.460 


0.534 


SCAD 


0.292 


3.00 


2.75 


0.003 


0.095 


0.902 


P-SCAD 


0.327 


2.91 


1.72 


0.055 


0.270 


0.675 


AIC 


0.496 


3.00 


1.40 


0.001 


0.265 


0.734 


BIC 


0.228 


3.00 


0.34 


0.002 


0.735 


0.263 








n= 120 








One-step SCAD 


0.271 


3.00 


1.00 


0.001 


0.552 


0.447 


One-step LOG 


0.266 


3.00 


0.76 


0.000 


0.603 


0.397 


One-step Lo.oi 


0.266 


3.00 


0.77 


0.000 


0.601 


0.399 


SCAD 


0.342 


3.00 


2.36 


0.000 


0.174 


0.826 


P-SCAD 


0.356 


2.95 


1.60 


0.037 


0.322 


0.641 


AIC 


0.594 


3.00 


1.45 


0.000 


0.235 


0.765 


BIC 


0.277 


3.00 


0.25 


0.000 


0.790 


0.210 
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Hospital Burn Center at the University of Southern California. The data set 
consists of 981 observations. Fan and Li [10] analyzed this data set as an 
illustration of the nonconcave penalized likelihood methods. As in Fan and 
Li [10], the binary response variable is taken to be the indicator whether 
the victims survived their burns or not. Four covariates, x\ = age, x 2 = sex, 
X3 = log(burn area + 1) and binary variable X4 = oxygen (0 = normal 1 = 
abnormal), are considered. To reduce modeling bias, quadratic terms of x\ 
and X3 and all interaction terms were included in the logistic regression 
model. We computed the one-step estimators with the SCAD and loga- 
rithm penalties. The regularization parameter was chosen by 5-fold cross- 
validation. The logarithm of selected A equals —0.356 and —7.095 for the 
one-step estimates with the SCAD and logarithm penalties, respectively. 

With the selected regularization parameter, the fitted one-step SCAD 
sparse estimate yields the following model 

(5.1) logit{P(Y = l|x)} = 4.82 - 8.74x1 - 4.79x§ + 6.67xix 3 , 

where Y = 1 stands for a victims survived from his/her burns. This model 
indicates that only xi and X3 are significant. This is the same as the ones 
in the model selected by the SCAD with the LQA algorithm and reported 
in Fan and Li [10]. The one-step fit with logarithm penalty is 

logit{P(y = l|x)} = 4.55 - 6.45xi - 0.29x 4 

(5.2) 

- 0.56x? - 4.21:c| + 5.21xiX3 - 0.15x 2 x 3 . 

It selects more variables than (5.1). This is consistent with Table 2, from 
which we can see that one-step fit with logarithm penalty has a higher rate 
of "over-fit" than the one-step SCAD estimator. The one-step L0.01 fit is 
almost identical to (5.2). 

6. Proofs. 

6.1. Proof of Theorem 1. At the fc-step, define a function with parameter 
/3 (fc) as follows 

G(/9|/9<*>) = UP) - nJ2M\Pf\) +Px(\Pf\m\ ~ \$ k) \))- 

Observe that Q((3 {k) ) = G(/3 (fe) |/3 (fc) ), and 

Q(f3) - G((3\(3^)=nJ2lPx(\rt k) \) + p'x(Wf\)m ~ |/f } |) -Px(\Pj\)}- 
3=1 
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By the concavity of the penalty function p\(-), we have 

px(wf\) +pwf ] \)(w - \Pt ] \)-Pxm) > o. 

(k) 

If pf' = o we use the right derivative. Thus it follows that 

Q{f3)>G((3\f3^). 

We can take ">" in the above inequality if p\(-) is strictly concave. Moreover, 
it is easy to check that 

= arg max 
P 

Hence we have that 

W (fc+1) ) > G(/3 (fc+1) |/3 (fc) ) > G{(3 {k) \f3 {k) ) = Q((3 {k) ). 
This completes the proof. 

6.2. Proof of Theorem 2. Without loss of generality let us consider t > to- 
It suffices to show 

/ \ ib(t)-ib*(t) 

(6.1) — — ^>0. 

t — to 

Note that 

i>{t) - r(t) = m ~ ^(*o) " Px(to)(t - t ). 
Thus (6.1) is equivalent to 

(6-2) — >Px(to)- 

t — to 

Take a sequence of {tk} such that to <tk <t and tk — ► to- By the convexity 
of </>(•), we know 

(6.3) fd§ > tM^tM vfc. 

f — to tk — to 

Since (/>(•) is a majorization of Pa(") a t to, we have 

^ 6 4 ^ V^fa) ~-0(t o ) > pa fa) - Pa (tp) 

tfc — to tfc — to 

Thus combining (6.3) and (6.4), we know 

^(t)-^(tp) y Px(t k ) - p x (to) yk 

t — to tk — to 

Taking the limit in the above inequality we obtain (6.2). Similar arguments 
can be applied to the case of t < to- 
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6.3. Proof of Theorem 3. It suffices to show that 



(6.5) 



E 



1 



n P 'M\). 



Then (2.13) is equivalent to (2.7), which in turn shows that LLA is identical 
to the EM algorithm. 

By P( T j\P,y) 0C P(/^ki) 7 !"( T i)> we nave 



E 



Ay 



1^ P{Pj\ T j)^i T j) dr i 
and (2.11) and (2.12) yield 

So ( 1 / T j)p(Pj\ T j)n( T j) dT 3 _ dlog(exp(-npx{\/3j\))) 



So° P(Pj\TjMTj) dTj 



Hence (6.5) is proven. 

By the nonnegativity of H(t), it is easy to see that exp(— np\(\P\)) is a 
strictly decreasing function of \f3\, thus p\(-) is strictly increasing. To show 
Px(-) is unbounded, using dominant (or monotone) convergence theorem, we 
have exp(—npx(\/3\)) — > as |/?| — ► oo. Hence p\(-) is unbounded. 



6.4. Proof of Theorem 4 and Theorem 5. Theorem 4 can be proven by 
the same proof for Theorem 5, and therefore, we only prove Theorem 5. 
Let us define 

Vn(u) = l(JL+fi Q - /3W) T [- V 2 £(/J(°))] (4= + ft, - /3 (0) 



n 



.7 = 1 V« 



V n (u) - V n (0) = l -^[-VH{^)]^ + O9 - /3 (0) ) T [-V^(/3(°))]4 
2 v ra 



7/ 



3=1 



Let u(n) = argmin[y n (u) — V^(0)], then /3(ose) = (3 + -7/=^- 
By Slutsky's theorem, it follows that 



(6.6) 
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T 2 = {(3,-(3^) J [-VH{^)}^ 



'n 
(6.7) 



We can write T 3 as 



j=i i/vn j=i 



Note that 

|A)j + uj/y/n\ - |/3 j| 



Sign(/? 0j )n j /(/3 0j # 0) + |^|I(/3 0j = 0). 



We now examine the behavior of \/np' Xn (|/3 ? -°' ) |). First consider the case where 

P'x n (\Pf\) = Kp'(Wf\). When fa ± 0, since \(if\ - P |/3 0i |, continuous 

mapping theorem says that p'{\(3f ] \) ^pp'(\Poj\)- Hence ^/E\ n -> yields 
— >p 0. When /3oj = 0, T 3 j = if Uj = 0. For Uj ^0, we have 

T 3 , = I^IVHA.j/d/sfl) = |„ j |„P+.)/^„(|VS#|)-^!|-^. 



By — > d Nfal^iP^jj), then from ra( 1+s )/ 2 A n -> oo we see T 3i -> P 



oo. 



For the SCAD penalty we have similar conclusions. p'\{9) = if 9 > a\ n 

(a = 3.7). Thus, when fioj ^ 0, \(3j | — >p |/?oj| > 0, then A n — > ensures T 3 j = 

Sign^JtijV^d^l) ^ P 0. When = 0, T 3j = if ^ = 0. For ^ 

0, we have |/^ 0) | = O p (^)- Also note that p' x (9) = \ n for all < 9 < X n , 

which implies that if y/n\ n — ► oo, T 3 j = \fnp' x {\j3^ \ )\ uj \ = \uj\y/n\ n with 
probability tending to one. Thus T$j — >p oo. 
Let us write u = (uJojuJq) 1 ' . Then we have 

0, if u 20 = 0, 
oo, otherwise. 

Denote W = (Wj, Wj ) r . Combining (6.6), (6.7) and (6.8) we conclude that 
for each fixed u, 



(6.8) T 3 



V n (u)-V n (0) - d K(u) = 



\u[ Q h((3 10 )uiQ - Wjouio, ifn 20 = 0, 
oo, otherwise. 



The unique minimum of V(u) is u\q = l x 1 (/?io)VFio and 1*20 = 0. V n {u) — 
V n (0) is a convex function of u. By epiconvergence (Geyer [18] and Knight 
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and Fu [21]), we conclude that 

(6.9) fi(n)i 4 lrVio)Wi , 

(6.10) n(n) 20 ^0. 

By VFio =N(0,Ii(p lo )), (6.9) is equivalent to 

^(3(086)! - /3 10 ) - iV(0, /f'^io))- 

Note that (6.10) implies that y / n/3(ose)2 — >p 0. We now show that with 
probability tending to one, /3(ose)2 = 0. This is a stronger statement than 
(6.10). It suffices to prove that if (3qj = 0, P(/3j(ose) ^ 0) — > 0. Assume 
/3j(ose) 7^ 0. By KKT conditions of (3.3), we must have 

(6.H) ^ ( [- V 2 £(/3 (0) )] (3(ose) -pf®)). = V^X nP ' x ( | I3f ] \ ) . 

We have shown that when Pqj = 0, the right-hand side goes to oo in proba- 
bility. However, the left-hand side can be written as 



- Vn(P(ose) - /3 ; 



vH3 (0) -/3 ) 



By (6.9) and (6.10), we know the first term converges in law to some normal, 
and so does the second term. Thus 

P((3j(ose) / 0) < P(KKT condition(6.11) holds) -» 0. 

7. Discussion. In this article, we have proposed a new algorithm based 
on the LLA for maximizing the nonconcave penalized likelihood. We further 
suggest using the one-step LLA estimator as the final estimates, because the 
one-step estimator naturally adopts a sparse representation and enjoys the 
oracle properties. In addition, the one-step sparse estimate can dramatically 
reduce the computational cost in the fully iterative methods. The simulation 
shows that one-step sparse estimates have very competitive performance 
with finite samples. 

We have concentrated on the one-step sparse estimate for linear models 
and likelihood-based models, including generalized linear models. The pro- 
posed one-step sparse estimation method can be easily extended for variable 
selection in survival data analysis using penalized partial likelihood (Fan and 
Li [11] and Cai et al. [5]), variable selection for longitudinal data (Fan and Li 
[12]) and variable selection in semiparametric regression modeling (Li and 
Liang [26]). 
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